import pandas as pd
import datetime
import os
os.chdir('C:\\Users\\toramanse\\Downloads')
df = pd.read_csv('all_ticks_long.csv')
df.info()
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2848030 entries, 0 to 2848029 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 short_name object 1 timestamp object 2 price float64 dtypes: float64(1), object(2) memory usage: 65.2+ MB
short_name 0 timestamp 0 price 0 dtype: int64
As timestamp is object, timestamp with datetime type is to be added as timestamp_dt.
df['timestamp_dt'] = pd.to_datetime(df['timestamp'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2848030 entries, 0 to 2848029 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 short_name object 1 timestamp object 2 price float64 3 timestamp_dt datetime64[ns, UTC] dtypes: datetime64[ns, UTC](1), float64(1), object(2) memory usage: 86.9+ MB
df.describe(include = 'all')
| short_name | timestamp | price | timestamp_dt | |
|---|---|---|---|---|
| count | 2848030 | 2848030 | 2.848030e+06 | 2848030 |
| unique | 60 | 50012 | NaN | NaN |
| top | GARAN | 2018-03-28T14:45:00Z | NaN | NaN |
| freq | 49308 | 60 | NaN | NaN |
| mean | NaN | NaN | 9.282380e+00 | 2016-04-26 08:05:50.160145664+00:00 |
| min | NaN | NaN | 0.000000e+00 | 2012-09-17 06:45:00+00:00 |
| 25% | NaN | NaN | 1.764200e+00 | 2014-08-28 07:00:00+00:00 |
| 50% | NaN | NaN | 4.427600e+00 | 2016-06-02 09:45:00+00:00 |
| 75% | NaN | NaN | 9.280000e+00 | 2018-01-05 07:15:00+00:00 |
| max | NaN | NaN | 1.394288e+02 | 2019-07-23 15:00:00+00:00 |
| std | NaN | NaN | 1.555997e+01 | NaN |
There are 60 stocks.\ Data have been collected between 2012-09-17 and 2019-07-23.\ Minimum of price requires to be more discovered more.
df_summ_byName = df.groupby(['short_name'])['price'].describe().sort_values(by = 'count', ascending = False)
print(df_summ_byName)
count mean std min 25% 50% \
short_name
GARAN 49308.0 7.899734 1.249637 0.0001 7.01540 7.6542
THYAO 49282.0 9.288821 4.027293 0.0001 6.43000 7.7800
YKBNK 49225.0 2.566327 0.422774 0.0001 2.26820 2.6093
ISCTR 49221.0 5.126551 1.003386 0.0001 4.32000 4.8543
VAKBN 49212.0 4.735438 0.977889 0.0001 4.03220 4.4742
AKBNK 49209.0 6.473105 0.944955 0.0001 5.85000 6.3057
PETKM 49184.0 2.539237 1.378510 0.0001 1.28690 2.2845
EREGL 49173.0 4.179544 2.690731 0.0001 2.18120 3.0360
KRDMD 49161.0 1.768390 0.940092 0.0001 1.08450 1.3979
TUPRS 49143.0 62.994535 32.398117 0.0001 34.54910 49.5542
TCELL 49143.0 9.828003 2.356250 0.0001 8.56630 9.7001
SAHOL 49095.0 8.615896 0.955310 0.0001 7.96520 8.6079
KCHOL 49093.0 12.248291 3.181444 0.0001 9.73680 12.0449
SISE 49090.0 3.048367 1.422848 0.0001 1.92200 2.6682
TTKOM 49077.0 5.660680 0.818598 0.0001 5.26730 5.7464
HALKB 49071.0 10.919353 3.071563 0.0001 8.72050 10.6531
GUBRF 49057.0 4.328323 1.222988 0.0001 3.27650 4.2500
ARCLK 49045.0 15.388088 4.531459 0.0001 11.71110 15.0100
FROTO 48995.0 32.763693 14.732664 0.0001 21.49380 27.1182
GOODY 48961.0 3.102485 0.886456 0.0001 2.42770 3.1920
TOASO 48946.0 16.597275 6.328241 0.0001 10.36560 16.5554
BRISA 48937.0 6.544896 1.295321 0.0001 5.89000 6.7300
TKFEN 48930.0 9.191809 6.667047 0.0001 4.31900 5.7532
MGROS 48903.0 19.576424 3.901269 0.0001 16.66000 19.1100
TRKCM 48886.0 2.027849 1.099667 0.0001 1.17420 1.6270
ZOREN 48807.0 1.248124 0.311330 0.0001 1.03380 1.2500
ASELS 48803.0 13.432535 9.624246 0.0001 4.94030 9.2757
OTKAR 48785.0 81.419528 27.782825 0.0001 56.77570 82.8224
VESTL 48781.0 5.942711 2.830465 0.0000 4.02000 6.3200
CCOLA 48749.0 36.890707 6.747213 0.0001 31.97820 34.8215
BAGFS 48650.0 10.407127 3.618058 0.0001 8.26175 10.6100
AKSA 48594.0 7.127504 2.710033 0.0001 5.20880 6.9853
KARSN 48527.0 1.326907 0.290413 0.0001 1.11000 1.2874
ECILC 48492.0 2.075865 0.973788 0.0001 1.17230 1.8214
PRKME 48466.0 2.927109 0.721949 0.0001 2.38950 2.7400
ASUZU 48433.0 6.467033 2.201036 0.0001 5.07480 5.9496
TSKB 48384.0 0.945233 0.155276 0.0001 0.82540 0.9373
ALARK 48335.0 2.060859 0.575943 0.0001 1.56890 1.9376
SODA 48276.0 3.189591 2.045764 0.0001 1.47580 2.6684
AKSEN 48171.0 3.183542 0.724332 0.0000 2.67000 2.9300
ANACM 48165.0 1.672102 0.788365 0.0001 1.04700 1.2597
AEFES 48131.0 20.982235 2.494002 0.0001 19.16050 20.6465
AYGAZ 48119.0 8.101948 2.610402 0.0001 5.95150 7.7238
BANVT 47951.0 7.628230 6.267278 0.0000 2.59000 3.7100
USAK 47659.0 1.220452 0.459532 0.0001 0.95710 1.0500
SASA 47633.0 2.294876 2.492934 0.0001 0.31920 0.7335
KRDMB 47532.0 2.222798 0.686385 0.0001 1.56120 2.2007
SKBNK 47270.0 1.473651 0.294908 0.0001 1.20000 1.5100
ALBRK 46862.0 1.365549 0.167824 1.0255 1.22510 1.3602
CEMAS 46394.0 1.209088 0.799981 0.0000 0.70000 0.8700
YATAS 46055.0 2.434249 2.552377 0.0001 0.38860 0.9658
KAREL 46032.0 3.178023 2.133619 0.0001 1.53130 1.8200
TUKAS 45929.0 1.737529 0.867095 0.6500 1.06000 1.5300
YUNSA 45528.0 4.079695 1.347020 0.0001 3.00670 4.1078
PARSN 45325.0 8.276989 4.662471 0.0000 4.57000 7.8900
PGSUS 45221.0 24.789487 7.656535 0.0000 17.79000 25.6400
ICBCT 44336.0 2.828502 1.789883 0.0000 1.55960 2.0300
ISYAT 43184.0 0.537338 0.160246 0.0001 0.44120 0.4957
ISFIN 42877.0 1.559420 1.764839 0.0001 0.56390 0.8635
ISDMR 12227.0 5.351663 1.697918 1.0181 4.85420 5.9063
75% max
short_name
GARAN 8.67860 12.1554
THYAO 12.27000 19.9500
YKBNK 2.87400 3.9581
ISCTR 5.82030 7.9639
VAKBN 5.24600 7.5814
AKBNK 6.93250 9.2124
PETKM 3.88280 5.7697
EREGL 6.75870 10.4710
KRDMD 2.16900 4.9510
TUPRS 93.42870 139.2937
TCELL 11.23640 15.8125
SAHOL 9.26820 11.6826
KCHOL 15.16930 19.1500
SISE 4.14600 6.9230
TTKOM 6.26000 7.3500
HALKB 13.49090 20.2365
GUBRF 5.13000 13.6191
ARCLK 19.08770 26.4278
FROTO 48.51160 65.4192
GOODY 3.59660 58.7574
TOASO 20.65130 29.9218
BRISA 7.33000 10.3275
TKFEN 14.24675 27.3200
MGROS 22.10000 30.2600
TRKCM 2.98260 4.6432
ZOREN 1.42650 2.4430
ASELS 22.75670 46.7616
OTKAR 105.49880 139.4288
VESTL 7.45000 14.5400
CCOLA 42.04970 54.2208
BAGFS 12.35000 38.4352
AKSA 8.72000 15.1189
KARSN 1.47000 2.5000
ECILC 2.78090 4.2278
PRKME 3.43650 5.4300
ASUZU 7.12000 15.2800
TSKB 1.02440 1.4208
ALARK 2.42140 3.5143
SODA 4.28610 7.7659
AKSEN 3.75000 5.1900
ANACM 2.40210 3.5021
AEFES 22.73200 28.5090
AYGAZ 10.26900 13.5935
BANVT 11.93000 28.6800
USAK 1.37080 2.7578
SASA 4.94730 8.4260
KRDMB 2.72730 4.4960
SKBNK 1.72070 2.2516
ALBRK 1.50000 2.1900
CEMAS 1.50000 7.0100
YATAS 4.23000 10.6748
KAREL 5.25000 9.4600
TUKAS 2.13000 5.9200
YUNSA 4.72060 9.5275
PARSN 10.65000 29.8200
PGSUS 29.44000 50.6500
ICBCT 4.07000 11.2700
ISYAT 0.63330 1.1500
ISFIN 1.67420 9.8300
ISDMR 6.56070 7.5936
It seems that minimum value is an issue. Values of 0.0000 and 0.0001 must be checked.\ As stocks have different number of counts, wide format will have got null values. Null values will be handled once wide data is to be used for PCA.
chk_1 = (df['short_name'] == 'GARAN') & (df['price'] == 0.0001)
print(df[chk_1])
short_name timestamp price timestamp_dt 923352 GARAN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
Let's see what happened to GARAN on the same and following days.
print(df.loc[923351:923353])
short_name timestamp price timestamp_dt 923351 GARAN 2013-05-06T14:30:00Z 8.8245 2013-05-06 14:30:00+00:00 923352 GARAN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00 923353 GARAN 2013-05-07T06:30:00Z 9.3782 2013-05-07 06:30:00+00:00
Since the values before/after 0.0001 are nowhere nearby, it is obviously an error. Let's see all problematic rows. These values may be removed from the dataset.
chk_2 = (df['price'] <= 0.0001)
print(len(df[chk_2]))
print(df[chk_2])
57
short_name timestamp price timestamp_dt
3677 AEFES 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
52255 AKBNK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
101192 AKSA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
149396 AKSEN 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
197755 ALARK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
292904 ANACM 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
341494 ARCLK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
390625 ASELS 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
439340 ASUZU 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
487598 AYGAZ 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
535755 BAGFS 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
584341 BANVT 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
632516 BRISA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
681225 CCOLA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
730061 CEMAS 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
776443 ECILC 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
825137 EREGL 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
874182 FROTO 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
923352 GARAN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
972535 GOODY 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1021506 GUBRF 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1070479 HALKB 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1119276 ICBCT 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
1164077 ISCTR 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1224868 ISFIN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1267470 ISYAT 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1310861 KAREL 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1357097 KARSN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1406040 KCHOL 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1454773 KRDMB 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1502645 KRDMD 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1551768 MGROS 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1600490 OTKAR 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1648851 PARSN 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
1694902 PETKM 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1740153 PGSUS 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
1788960 PRKME 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1837687 SAHOL 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1886501 SASA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1934464 SISE 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
1983221 SKBNK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2030487 SODA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2079107 TCELL 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2128293 THYAO 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2177326 TKFEN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2226320 TOASO 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2275288 TRKCM 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2323912 TSKB 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2372619 TTKOM 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2467658 TUPRS 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2516586 USAK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2564510 VAKBN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2613379 VESTL 2013-05-06T16:45:00Z 0.0000 2013-05-06 16:45:00+00:00
2661987 YATAS 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2708568 YKBNK 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2756506 YUNSA 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
2803110 ZOREN 2013-05-06T16:45:00Z 0.0001 2013-05-06 16:45:00+00:00
These incorrect updates are to be removed. Summary of the data will be updated after removing so that it can be observed whether the error continues with some other so small values.
df = df[df['price'] > 0.0001]
df_summ_byName = df.groupby(['short_name'])['price'].describe().sort_values(by = 'count', ascending = False)
print(df_summ_byName)
count mean std min 25% 50% \
short_name
GARAN 49307.0 7.899894 1.249143 5.0936 7.015400 7.65420
THYAO 49281.0 9.289010 4.027117 3.2356 6.430000 7.78000
YKBNK 49224.0 2.566379 0.422620 1.5300 2.268200 2.60930
ISCTR 49220.0 5.126655 1.003131 3.4441 4.320000 4.85430
VAKBN 49211.0 4.735534 0.977666 3.0000 4.032200 4.47420
AKBNK 49208.0 6.473236 0.944514 4.3227 5.850000 6.30570
PETKM 49183.0 2.539289 1.378477 0.7379 1.286900 2.28450
EREGL 49172.0 4.179629 2.690693 0.7289 2.181200 3.03600
KRDMD 49160.0 1.768426 0.940068 0.7042 1.084500 1.39790
TUPRS 49142.0 62.995817 32.397201 24.1844 34.549100 49.55420
TCELL 49142.0 9.828203 2.355857 4.3504 8.566300 9.70010
SAHOL 49094.0 8.616071 0.954528 6.4824 7.965200 8.60790
KCHOL 49092.0 12.248541 3.180996 5.9145 9.736800 12.04490
SISE 49089.0 3.048430 1.422796 0.8881 1.922000 2.66820
TTKOM 49076.0 5.660795 0.818208 3.1900 5.267300 5.74640
HALKB 49070.0 10.919575 3.071199 4.9100 8.720500 10.65450
GUBRF 49056.0 4.328412 1.222844 2.2561 3.279125 4.25000
ARCLK 49044.0 15.388402 4.530973 6.0608 11.711100 15.01500
FROTO 48994.0 32.764362 14.732071 10.9305 21.493800 27.11820
GOODY 48960.0 3.102549 0.886355 1.5019 2.427700 3.19245
TOASO 48945.0 16.597614 6.327861 5.0917 10.365600 16.55540
BRISA 48936.0 6.545030 1.294996 3.2433 5.890000 6.73000
TKFEN 48929.0 9.191996 6.666985 3.0316 4.319000 5.75320
MGROS 48902.0 19.576824 3.900305 11.2500 16.660000 19.11000
TRKCM 48885.0 2.027891 1.099640 0.4006 1.174200 1.62700
ZOREN 48806.0 1.248150 0.311282 0.6302 1.033800 1.25000
ASELS 48802.0 13.432811 9.624152 2.9009 4.940300 9.27570
OTKAR 48784.0 81.421197 27.780664 24.1387 56.775700 82.82240
VESTL 48780.0 5.942833 2.830366 1.3300 4.020000 6.32000
CCOLA 48748.0 36.891464 6.745213 23.7252 31.978200 34.82150
BAGFS 48649.0 10.407341 3.617788 4.6300 8.267000 10.61000
AKSA 48593.0 7.127651 2.709868 1.5268 5.208800 6.98530
KARSN 48526.0 1.326934 0.290354 0.7862 1.110000 1.28740
ECILC 48491.0 2.075908 0.973753 0.7669 1.172300 1.82140
PRKME 48465.0 2.927169 0.721834 1.8300 2.389500 2.74000
ASUZU 48432.0 6.467167 2.200862 2.3277 5.074800 5.94960
TSKB 48383.0 0.945252 0.155218 0.6400 0.825400 0.93730
ALARK 48334.0 2.060902 0.575873 1.1880 1.568900 1.93760
SODA 48275.0 3.189657 2.045734 0.2447 1.475800 2.66840
AKSEN 48170.0 3.183608 0.724194 2.0100 2.670000 2.93000
ANACM 48164.0 1.672136 0.788336 0.7514 1.047000 1.25970
AEFES 48130.0 20.982671 2.492193 13.4782 19.160500 20.64650
AYGAZ 48118.0 8.102116 2.610168 3.2193 5.951500 7.72380
BANVT 47950.0 7.628389 6.267246 1.7800 2.590000 3.71000
USAK 47658.0 1.220477 0.459503 0.5353 0.957100 1.05000
SASA 47632.0 2.294924 2.492938 0.1820 0.319200 0.73350
KRDMB 47531.0 2.222844 0.686316 1.1997 1.561200 2.20070
SKBNK 47269.0 1.473682 0.294833 0.8300 1.200000 1.51000
ALBRK 46862.0 1.365549 0.167824 1.0255 1.225100 1.36020
CEMAS 46393.0 1.209114 0.799970 0.5000 0.700000 0.87000
YATAS 46054.0 2.434302 2.552380 0.2029 0.388600 0.96580
KAREL 46031.0 3.178092 2.133591 1.1200 1.531300 1.82000
TUKAS 45929.0 1.737529 0.867095 0.6500 1.060000 1.53000
YUNSA 45527.0 4.079785 1.346899 2.0874 3.006700 4.10780
PARSN 45324.0 8.277171 4.662360 2.2500 4.570000 7.89000
PGSUS 45220.0 24.790035 7.655733 11.0900 17.790000 25.64000
ICBCT 44335.0 2.828566 1.789853 0.6800 1.559600 2.03000
ISYAT 43183.0 0.537350 0.160227 0.1984 0.441200 0.49570
ISFIN 42876.0 1.559456 1.764844 0.2369 0.563900 0.86350
ISDMR 12227.0 5.351663 1.697918 1.0181 4.854200 5.90630
75% max
short_name
GARAN 8.678600 12.1554
THYAO 12.270000 19.9500
YKBNK 2.874000 3.9581
ISCTR 5.820300 7.9639
VAKBN 5.246000 7.5814
AKBNK 6.932500 9.2124
PETKM 3.882800 5.7697
EREGL 6.758700 10.4710
KRDMD 2.169000 4.9510
TUPRS 93.428700 139.2937
TCELL 11.236400 15.8125
SAHOL 9.268600 11.6826
KCHOL 15.169300 19.1500
SISE 4.146000 6.9230
TTKOM 6.260000 7.3500
HALKB 13.490900 20.2365
GUBRF 5.130000 13.6191
ARCLK 19.087700 26.4278
FROTO 48.511600 65.4192
GOODY 3.596600 58.7574
TOASO 20.651300 29.9218
BRISA 7.330000 10.3275
TKFEN 14.248900 27.3200
MGROS 22.100000 30.2600
TRKCM 2.982600 4.6432
ZOREN 1.426500 2.4430
ASELS 22.756700 46.7616
OTKAR 105.498800 139.4288
VESTL 7.450000 14.5400
CCOLA 42.056525 54.2208
BAGFS 12.350000 38.4352
AKSA 8.720000 15.1189
KARSN 1.470000 2.5000
ECILC 2.780900 4.2278
PRKME 3.436500 5.4300
ASUZU 7.120000 15.2800
TSKB 1.024400 1.4208
ALARK 2.421400 3.5143
SODA 4.286100 7.7659
AKSEN 3.750000 5.1900
ANACM 2.402100 3.5021
AEFES 22.732000 28.5090
AYGAZ 10.269000 13.5935
BANVT 11.930000 28.6800
USAK 1.370800 2.7578
SASA 4.947300 8.4260
KRDMB 2.727300 4.4960
SKBNK 1.720700 2.2516
ALBRK 1.500000 2.1900
CEMAS 1.500000 7.0100
YATAS 4.230000 10.6748
KAREL 5.250000 9.4600
TUKAS 2.130000 5.9200
YUNSA 4.720600 9.5275
PARSN 10.650000 29.8200
PGSUS 29.440000 50.6500
ICBCT 4.070000 11.2700
ISYAT 0.633300 1.1500
ISFIN 1.674200 9.8300
ISDMR 6.560700 7.5936
Let's see the plot of time series to check if any unusual fluctuation exists.
import matplotlib.pyplot as plt
nrow=10
ncol=6
fig, axes = plt.subplots(nrow, ncol, figsize = (30, 40))
cnt=0
for r in range(nrow):
for c in range(ncol):
df.groupby(df[df['short_name'] == df_summ_byName.index[cnt]]['timestamp_dt'])['price'].mean().plot(ax=axes[r,c])
axes[r,c].set_ylabel(df_summ_byName.index[cnt])
cnt+=1
GUBRF has an unusual jump at the end of 2014.\ GOODY has an unusual "era" in the middle of 2016.\ Let's analyse these stocks and relevant dates.\ Maximum values from summary table would benefit to spot the date. GUBRF: 13.6191 & GOODY: 58.7574
chk_3 = (df['short_name'] == 'GUBRF') & (df['price'] == 13.6191)
chk_4 = (df['short_name'] == 'GOODY') & (df['price'] == 58.7574)
print(df[chk_3])
print(df[chk_4])
short_name timestamp price timestamp_dt
1025255 GUBRF 2013-12-05T15:30:00Z 13.6191 2013-12-05 15:30:00+00:00
short_name timestamp price timestamp_dt
987040 GOODY 2015-07-15T12:00:00Z 58.7574 2015-07-15 12:00:00+00:00
Let's check the values before/after.
print(df.loc[1025254:1025256])
print(df.loc[987039:987041])
short_name timestamp price timestamp_dt
1025254 GUBRF 2013-12-05T15:15:00Z 3.4025 2013-12-05 15:15:00+00:00
1025255 GUBRF 2013-12-05T15:30:00Z 13.6191 2013-12-05 15:30:00+00:00
1025256 GUBRF 2013-12-06T07:30:00Z 3.3933 2013-12-06 07:30:00+00:00
short_name timestamp price timestamp_dt
987039 GOODY 2015-07-15T11:45:00Z 2.4886 2015-07-15 11:45:00+00:00
987040 GOODY 2015-07-15T12:00:00Z 58.7574 2015-07-15 12:00:00+00:00
987041 GOODY 2015-07-15T12:15:00Z 2.4808 2015-07-15 12:15:00+00:00
Mentioned maximum values are likely to be errors. They will be removed in the next step and then summary table will be updated.
df.drop(1025255, inplace = True)
df.drop(987040, inplace = True)
df_summ_byName = df.groupby(['short_name'])['price'].describe().sort_values(by = 'count', ascending = False)
print(df_summ_byName)
count mean std min 25% 50% \
short_name
GARAN 49307.0 7.899894 1.249143 5.0936 7.01540 7.6542
THYAO 49281.0 9.289010 4.027117 3.2356 6.43000 7.7800
YKBNK 49224.0 2.566379 0.422620 1.5300 2.26820 2.6093
ISCTR 49220.0 5.126655 1.003131 3.4441 4.32000 4.8543
VAKBN 49211.0 4.735534 0.977666 3.0000 4.03220 4.4742
AKBNK 49208.0 6.473236 0.944514 4.3227 5.85000 6.3057
PETKM 49183.0 2.539289 1.378477 0.7379 1.28690 2.2845
EREGL 49172.0 4.179629 2.690693 0.7289 2.18120 3.0360
KRDMD 49160.0 1.768426 0.940068 0.7042 1.08450 1.3979
TUPRS 49142.0 62.995817 32.397201 24.1844 34.54910 49.5542
TCELL 49142.0 9.828203 2.355857 4.3504 8.56630 9.7001
SAHOL 49094.0 8.616071 0.954528 6.4824 7.96520 8.6079
KCHOL 49092.0 12.248541 3.180996 5.9145 9.73680 12.0449
SISE 49089.0 3.048430 1.422796 0.8881 1.92200 2.6682
TTKOM 49076.0 5.660795 0.818208 3.1900 5.26730 5.7464
HALKB 49070.0 10.919575 3.071199 4.9100 8.72050 10.6545
GUBRF 49055.0 4.328222 1.222137 2.2561 3.27825 4.2500
ARCLK 49044.0 15.388402 4.530973 6.0608 11.71110 15.0150
FROTO 48994.0 32.764362 14.732071 10.9305 21.49380 27.1182
GOODY 48959.0 3.101412 0.849924 1.5019 2.42770 3.1920
TOASO 48945.0 16.597614 6.327861 5.0917 10.36560 16.5554
BRISA 48936.0 6.545030 1.294996 3.2433 5.89000 6.7300
TKFEN 48929.0 9.191996 6.666985 3.0316 4.31900 5.7532
MGROS 48902.0 19.576824 3.900305 11.2500 16.66000 19.1100
TRKCM 48885.0 2.027891 1.099640 0.4006 1.17420 1.6270
ZOREN 48806.0 1.248150 0.311282 0.6302 1.03380 1.2500
ASELS 48802.0 13.432811 9.624152 2.9009 4.94030 9.2757
OTKAR 48784.0 81.421197 27.780664 24.1387 56.77570 82.8224
VESTL 48780.0 5.942833 2.830366 1.3300 4.02000 6.3200
CCOLA 48748.0 36.891464 6.745213 23.7252 31.97820 34.8215
BAGFS 48649.0 10.407341 3.617788 4.6300 8.26700 10.6100
AKSA 48593.0 7.127651 2.709868 1.5268 5.20880 6.9853
KARSN 48526.0 1.326934 0.290354 0.7862 1.11000 1.2874
ECILC 48491.0 2.075908 0.973753 0.7669 1.17230 1.8214
PRKME 48465.0 2.927169 0.721834 1.8300 2.38950 2.7400
ASUZU 48432.0 6.467167 2.200862 2.3277 5.07480 5.9496
TSKB 48383.0 0.945252 0.155218 0.6400 0.82540 0.9373
ALARK 48334.0 2.060902 0.575873 1.1880 1.56890 1.9376
SODA 48275.0 3.189657 2.045734 0.2447 1.47580 2.6684
AKSEN 48170.0 3.183608 0.724194 2.0100 2.67000 2.9300
ANACM 48164.0 1.672136 0.788336 0.7514 1.04700 1.2597
AEFES 48130.0 20.982671 2.492193 13.4782 19.16050 20.6465
AYGAZ 48118.0 8.102116 2.610168 3.2193 5.95150 7.7238
BANVT 47950.0 7.628389 6.267246 1.7800 2.59000 3.7100
USAK 47658.0 1.220477 0.459503 0.5353 0.95710 1.0500
SASA 47632.0 2.294924 2.492938 0.1820 0.31920 0.7335
KRDMB 47531.0 2.222844 0.686316 1.1997 1.56120 2.2007
SKBNK 47269.0 1.473682 0.294833 0.8300 1.20000 1.5100
ALBRK 46862.0 1.365549 0.167824 1.0255 1.22510 1.3602
CEMAS 46393.0 1.209114 0.799970 0.5000 0.70000 0.8700
YATAS 46054.0 2.434302 2.552380 0.2029 0.38860 0.9658
KAREL 46031.0 3.178092 2.133591 1.1200 1.53130 1.8200
TUKAS 45929.0 1.737529 0.867095 0.6500 1.06000 1.5300
YUNSA 45527.0 4.079785 1.346899 2.0874 3.00670 4.1078
PARSN 45324.0 8.277171 4.662360 2.2500 4.57000 7.8900
PGSUS 45220.0 24.790035 7.655733 11.0900 17.79000 25.6400
ICBCT 44335.0 2.828566 1.789853 0.6800 1.55960 2.0300
ISYAT 43183.0 0.537350 0.160227 0.1984 0.44120 0.4957
ISFIN 42876.0 1.559456 1.764844 0.2369 0.56390 0.8635
ISDMR 12227.0 5.351663 1.697918 1.0181 4.85420 5.9063
75% max
short_name
GARAN 8.678600 12.1554
THYAO 12.270000 19.9500
YKBNK 2.874000 3.9581
ISCTR 5.820300 7.9639
VAKBN 5.246000 7.5814
AKBNK 6.932500 9.2124
PETKM 3.882800 5.7697
EREGL 6.758700 10.4710
KRDMD 2.169000 4.9510
TUPRS 93.428700 139.2937
TCELL 11.236400 15.8125
SAHOL 9.268600 11.6826
KCHOL 15.169300 19.1500
SISE 4.146000 6.9230
TTKOM 6.260000 7.3500
HALKB 13.490900 20.2365
GUBRF 5.130000 8.6743
ARCLK 19.087700 26.4278
FROTO 48.511600 65.4192
GOODY 3.596600 58.2892
TOASO 20.651300 29.9218
BRISA 7.330000 10.3275
TKFEN 14.248900 27.3200
MGROS 22.100000 30.2600
TRKCM 2.982600 4.6432
ZOREN 1.426500 2.4430
ASELS 22.756700 46.7616
OTKAR 105.498800 139.4288
VESTL 7.450000 14.5400
CCOLA 42.056525 54.2208
BAGFS 12.350000 38.4352
AKSA 8.720000 15.1189
KARSN 1.470000 2.5000
ECILC 2.780900 4.2278
PRKME 3.436500 5.4300
ASUZU 7.120000 15.2800
TSKB 1.024400 1.4208
ALARK 2.421400 3.5143
SODA 4.286100 7.7659
AKSEN 3.750000 5.1900
ANACM 2.402100 3.5021
AEFES 22.732000 28.5090
AYGAZ 10.269000 13.5935
BANVT 11.930000 28.6800
USAK 1.370800 2.7578
SASA 4.947300 8.4260
KRDMB 2.727300 4.4960
SKBNK 1.720700 2.2516
ALBRK 1.500000 2.1900
CEMAS 1.500000 7.0100
YATAS 4.230000 10.6748
KAREL 5.250000 9.4600
TUKAS 2.130000 5.9200
YUNSA 4.720600 9.5275
PARSN 10.650000 29.8200
PGSUS 29.440000 50.6500
ICBCT 4.070000 11.2700
ISYAT 0.633300 1.1500
ISFIN 1.674200 9.8300
ISDMR 6.560700 7.5936
The comparison between mean and median implies that most of the distributions are skewed. Let's see KDEs with diverse bandwidths.
nrow=10
ncol=6
fig, axes = plt.subplots(nrow, ncol, figsize = (30, 40))
cnt=0
for r in range(nrow):
for c in range(ncol):
df[df['short_name'] == df_summ_byName.index[cnt]]['price'].plot.kde(ax=axes[r,c])
df[df['short_name'] == df_summ_byName.index[cnt]]['price'].plot.kde(bw_method=.3, ax=axes[r,c])
df[df['short_name'] == df_summ_byName.index[cnt]]['price'].plot.kde(bw_method=.5, ax=axes[r,c])
df[df['short_name'] == df_summ_byName.index[cnt]]['price'].plot.kde(bw_method=1, ax=axes[r,c])
axes[r,c].set_ylabel(df_summ_byName.index[cnt])
cnt+=1
Most of the distributions is not normally distributed. That is why, it would be better to use median for central tendency if ever needed.
Dataset should be transformed to wide format.
df_wide = pd.pivot(df,index = 'timestamp_dt', columns = 'short_name', values = 'price')
df_wide.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 50011 entries, 2012-09-17 06:45:00+00:00 to 2019-07-23 15:00:00+00:00 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AEFES 48130 non-null float64 1 AKBNK 49208 non-null float64 2 AKSA 48593 non-null float64 3 AKSEN 48170 non-null float64 4 ALARK 48334 non-null float64 5 ALBRK 46862 non-null float64 6 ANACM 48164 non-null float64 7 ARCLK 49044 non-null float64 8 ASELS 48802 non-null float64 9 ASUZU 48432 non-null float64 10 AYGAZ 48118 non-null float64 11 BAGFS 48649 non-null float64 12 BANVT 47950 non-null float64 13 BRISA 48936 non-null float64 14 CCOLA 48748 non-null float64 15 CEMAS 46393 non-null float64 16 ECILC 48491 non-null float64 17 EREGL 49172 non-null float64 18 FROTO 48994 non-null float64 19 GARAN 49307 non-null float64 20 GOODY 48960 non-null float64 21 GUBRF 49056 non-null float64 22 HALKB 49070 non-null float64 23 ICBCT 44335 non-null float64 24 ISCTR 49220 non-null float64 25 ISDMR 12227 non-null float64 26 ISFIN 42876 non-null float64 27 ISYAT 43183 non-null float64 28 KAREL 46031 non-null float64 29 KARSN 48526 non-null float64 30 KCHOL 49092 non-null float64 31 KRDMB 47531 non-null float64 32 KRDMD 49160 non-null float64 33 MGROS 48902 non-null float64 34 OTKAR 48784 non-null float64 35 PARSN 45324 non-null float64 36 PETKM 49183 non-null float64 37 PGSUS 45220 non-null float64 38 PRKME 48465 non-null float64 39 SAHOL 49094 non-null float64 40 SASA 47632 non-null float64 41 SISE 49089 non-null float64 42 SKBNK 47269 non-null float64 43 SODA 48275 non-null float64 44 TCELL 49142 non-null float64 45 THYAO 49281 non-null float64 46 TKFEN 48929 non-null float64 47 TOASO 48945 non-null float64 48 TRKCM 48885 non-null float64 49 TSKB 48383 non-null float64 50 TTKOM 49076 non-null float64 51 TUKAS 45929 non-null float64 52 TUPRS 49142 non-null float64 53 USAK 47658 non-null float64 54 VAKBN 49211 non-null float64 55 VESTL 48780 non-null float64 56 YATAS 46054 non-null float64 57 YKBNK 49224 non-null float64 58 YUNSA 45527 non-null float64 59 ZOREN 48806 non-null float64 dtypes: float64(60) memory usage: 23.3 MB
There are null values as expected. These values will be handled in 2 steps as follows.
df_wide.fillna(method = 'bfill', inplace = True)
df_wide.fillna(method = 'ffill', inplace = True)
df_wide.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 50011 entries, 2012-09-17 06:45:00+00:00 to 2019-07-23 15:00:00+00:00 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AEFES 50011 non-null float64 1 AKBNK 50011 non-null float64 2 AKSA 50011 non-null float64 3 AKSEN 50011 non-null float64 4 ALARK 50011 non-null float64 5 ALBRK 50011 non-null float64 6 ANACM 50011 non-null float64 7 ARCLK 50011 non-null float64 8 ASELS 50011 non-null float64 9 ASUZU 50011 non-null float64 10 AYGAZ 50011 non-null float64 11 BAGFS 50011 non-null float64 12 BANVT 50011 non-null float64 13 BRISA 50011 non-null float64 14 CCOLA 50011 non-null float64 15 CEMAS 50011 non-null float64 16 ECILC 50011 non-null float64 17 EREGL 50011 non-null float64 18 FROTO 50011 non-null float64 19 GARAN 50011 non-null float64 20 GOODY 50011 non-null float64 21 GUBRF 50011 non-null float64 22 HALKB 50011 non-null float64 23 ICBCT 50011 non-null float64 24 ISCTR 50011 non-null float64 25 ISDMR 50011 non-null float64 26 ISFIN 50011 non-null float64 27 ISYAT 50011 non-null float64 28 KAREL 50011 non-null float64 29 KARSN 50011 non-null float64 30 KCHOL 50011 non-null float64 31 KRDMB 50011 non-null float64 32 KRDMD 50011 non-null float64 33 MGROS 50011 non-null float64 34 OTKAR 50011 non-null float64 35 PARSN 50011 non-null float64 36 PETKM 50011 non-null float64 37 PGSUS 50011 non-null float64 38 PRKME 50011 non-null float64 39 SAHOL 50011 non-null float64 40 SASA 50011 non-null float64 41 SISE 50011 non-null float64 42 SKBNK 50011 non-null float64 43 SODA 50011 non-null float64 44 TCELL 50011 non-null float64 45 THYAO 50011 non-null float64 46 TKFEN 50011 non-null float64 47 TOASO 50011 non-null float64 48 TRKCM 50011 non-null float64 49 TSKB 50011 non-null float64 50 TTKOM 50011 non-null float64 51 TUKAS 50011 non-null float64 52 TUPRS 50011 non-null float64 53 USAK 50011 non-null float64 54 VAKBN 50011 non-null float64 55 VESTL 50011 non-null float64 56 YATAS 50011 non-null float64 57 YKBNK 50011 non-null float64 58 YUNSA 50011 non-null float64 59 ZOREN 50011 non-null float64 dtypes: float64(60) memory usage: 23.3 MB
from seaborn import heatmap
corr = df_wide.corr()
plt.figure(figsize=(40,40))
heatmap(corr, cmap="RdBu",annot=True)
plt.show()
'SAHOL', 'MGROS', 'HALKB', 'THYAO' and 'GARAN' are selected for further analysis.
pd.plotting.scatter_matrix(df_wide[['SAHOL', 'MGROS', 'HALKB', 'THYAO', 'GARAN']])
array([[<AxesSubplot:xlabel='SAHOL', ylabel='SAHOL'>,
<AxesSubplot:xlabel='MGROS', ylabel='SAHOL'>,
<AxesSubplot:xlabel='HALKB', ylabel='SAHOL'>,
<AxesSubplot:xlabel='THYAO', ylabel='SAHOL'>,
<AxesSubplot:xlabel='GARAN', ylabel='SAHOL'>],
[<AxesSubplot:xlabel='SAHOL', ylabel='MGROS'>,
<AxesSubplot:xlabel='MGROS', ylabel='MGROS'>,
<AxesSubplot:xlabel='HALKB', ylabel='MGROS'>,
<AxesSubplot:xlabel='THYAO', ylabel='MGROS'>,
<AxesSubplot:xlabel='GARAN', ylabel='MGROS'>],
[<AxesSubplot:xlabel='SAHOL', ylabel='HALKB'>,
<AxesSubplot:xlabel='MGROS', ylabel='HALKB'>,
<AxesSubplot:xlabel='HALKB', ylabel='HALKB'>,
<AxesSubplot:xlabel='THYAO', ylabel='HALKB'>,
<AxesSubplot:xlabel='GARAN', ylabel='HALKB'>],
[<AxesSubplot:xlabel='SAHOL', ylabel='THYAO'>,
<AxesSubplot:xlabel='MGROS', ylabel='THYAO'>,
<AxesSubplot:xlabel='HALKB', ylabel='THYAO'>,
<AxesSubplot:xlabel='THYAO', ylabel='THYAO'>,
<AxesSubplot:xlabel='GARAN', ylabel='THYAO'>],
[<AxesSubplot:xlabel='SAHOL', ylabel='GARAN'>,
<AxesSubplot:xlabel='MGROS', ylabel='GARAN'>,
<AxesSubplot:xlabel='HALKB', ylabel='GARAN'>,
<AxesSubplot:xlabel='THYAO', ylabel='GARAN'>,
<AxesSubplot:xlabel='GARAN', ylabel='GARAN'>]], dtype=object)
SAHOL has a linear relation with MGROS and GARAN.\ Between HALKB and THYAO, there seems to be a non-linear relation with a negative correlation (-55% from the previous chart).
Prices at day-close could be used for 30-day rolling correlation. Month-close would also be worth to assess. Let's arrange the data accordingly.
mask = df_wide.index.to_frame(index = False)
mask['date_str'] = mask['timestamp_dt'].dt.strftime('%Y-%m-%d')
mask['close'] = mask.groupby(['date_str'])['timestamp_dt'].rank(method="first", ascending=False)
mask.index = mask['timestamp_dt']
df_wide_close = df_wide[mask['close'] == 1]
df_wide_close['GARAN'].rolling(30).corr(df_wide_close['THYAO']).plot()
<AxesSubplot:xlabel='timestamp_dt'>
df_wide_close['SAHOL'].rolling(30).corr(df_wide_close['GARAN']).plot()
<AxesSubplot:xlabel='timestamp_dt'>
df_wide_close['SAHOL'].rolling(30).corr(df_wide_close['MGROS']).plot()
<AxesSubplot:xlabel='timestamp_dt'>
There are unusual twists in which "sudden" negative correlations occur. These fluctuations show that there may be special occasions, that have impact on stock prices.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
std_scaler = StandardScaler()
scaled_df = std_scaler.fit_transform(df_wide)
pca = PCA(n_components=3)
pca.fit_transform(scaled_df)
print(sum(pca.explained_variance_ratio_))
0.7873845938682055
from numpy import arange
nums = arange(16)
var_ratio = []
for num in nums:
pca = PCA(n_components=num)
pca.fit_transform(scaled_df)
var_ratio.append(sum(pca.explained_variance_ratio_))
pd.DataFrame(var_ratio)
| 0 | |
|---|---|
| 0 | 0.000000 |
| 1 | 0.508158 |
| 2 | 0.680285 |
| 3 | 0.787385 |
| 4 | 0.840039 |
| 5 | 0.877802 |
| 6 | 0.900099 |
| 7 | 0.914928 |
| 8 | 0.925468 |
| 9 | 0.934236 |
| 10 | 0.942141 |
| 11 | 0.949946 |
| 12 | 0.955847 |
| 13 | 0.961157 |
| 14 | 0.965944 |
| 15 | 0.970018 |
plt.figure(figsize=(10,5),dpi=150)
plt.grid()
plt.plot(nums,var_ratio,marker='.')
plt.xlabel('Components')
plt.ylabel('Explained Variance')
plt.title('')
Text(0.5, 1.0, '')
pd.DataFrame(pca.components_, columns=df_wide.columns)
| short_name | AEFES | AKBNK | AKSA | AKSEN | ALARK | ALBRK | ANACM | ARCLK | ASELS | ASUZU | ... | TTKOM | TUKAS | TUPRS | USAK | VAKBN | VESTL | YATAS | YKBNK | YUNSA | ZOREN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.007884 | 0.109773 | 0.160242 | 0.064050 | 0.136022 | 0.005701 | 0.167446 | 0.121985 | 0.173918 | 0.148672 | ... | -0.014902 | 0.118653 | 0.167456 | 0.127119 | 0.091495 | 0.155937 | 0.170247 | -0.040827 | 0.090020 | 0.128302 |
| 1 | -0.163204 | -0.215956 | -0.037141 | -0.161634 | -0.086260 | -0.113737 | 0.053171 | -0.037718 | -0.002523 | -0.093909 | ... | -0.267312 | 0.064891 | 0.090344 | -0.146539 | -0.252506 | 0.029910 | -0.021797 | -0.276557 | -0.002676 | -0.101452 |
| 2 | -0.266361 | 0.073627 | 0.072012 | -0.150636 | -0.105521 | -0.213442 | -0.089468 | 0.251793 | 0.004175 | -0.130958 | ... | 0.109220 | 0.075144 | -0.008282 | 0.082747 | 0.020538 | 0.088376 | -0.054729 | -0.057006 | -0.261489 | 0.071375 |
| 3 | 0.012320 | -0.088794 | 0.081267 | -0.242351 | -0.127211 | 0.238619 | -0.090194 | -0.016968 | -0.092453 | 0.081723 | ... | 0.116449 | 0.123515 | -0.035836 | -0.182184 | -0.068728 | 0.127475 | -0.046387 | 0.016955 | 0.191121 | 0.179658 |
| 4 | 0.018392 | -0.114373 | 0.065223 | 0.222204 | -0.248843 | -0.086334 | -0.014453 | -0.139324 | 0.053475 | 0.110114 | ... | -0.051197 | -0.277873 | -0.067223 | 0.070254 | -0.109607 | -0.004283 | 0.009788 | -0.117374 | 0.172259 | 0.177518 |
| 5 | 0.013732 | 0.090825 | -0.204364 | 0.227431 | -0.138328 | 0.236182 | 0.100423 | -0.156592 | -0.066873 | -0.008327 | ... | 0.011679 | 0.245150 | 0.095994 | -0.051980 | 0.034947 | 0.131766 | -0.000133 | 0.107169 | -0.045460 | 0.129389 |
| 6 | 0.046402 | 0.104722 | -0.156250 | 0.075086 | -0.130244 | 0.355938 | 0.031930 | 0.117513 | -0.037452 | -0.078817 | ... | -0.093102 | -0.319589 | -0.007254 | -0.158582 | 0.104918 | 0.008767 | -0.130671 | 0.079387 | 0.081329 | 0.053083 |
| 7 | -0.062990 | -0.059549 | -0.084064 | -0.064799 | 0.002461 | 0.185296 | -0.049895 | -0.028142 | 0.163627 | -0.097042 | ... | -0.167614 | -0.000576 | 0.152095 | 0.194529 | -0.030297 | -0.241855 | -0.047618 | -0.152912 | -0.117002 | -0.123725 |
| 8 | -0.140147 | -0.005320 | -0.090982 | -0.122588 | 0.126794 | 0.015535 | -0.046909 | 0.008446 | -0.033480 | -0.033334 | ... | -0.099281 | 0.028172 | -0.041034 | -0.042117 | -0.010994 | 0.231828 | -0.010268 | 0.170796 | -0.073789 | 0.188342 |
| 9 | 0.496897 | 0.099233 | -0.103102 | -0.142824 | -0.057264 | -0.453259 | -0.044484 | -0.058300 | -0.010849 | -0.013321 | ... | -0.102889 | -0.123427 | -0.015317 | -0.169483 | -0.054636 | 0.014707 | 0.026023 | 0.123451 | 0.018873 | -0.085393 |
| 10 | 0.241188 | 0.041454 | -0.016686 | -0.103325 | 0.007911 | 0.055355 | -0.040887 | 0.003469 | 0.087258 | 0.158994 | ... | 0.012170 | -0.114385 | -0.065119 | -0.198119 | 0.035625 | 0.002418 | 0.276943 | -0.076269 | -0.327880 | 0.199756 |
| 11 | 0.397264 | -0.115214 | 0.229817 | -0.017247 | -0.041410 | 0.240661 | 0.013882 | 0.147506 | -0.030173 | -0.071267 | ... | 0.118626 | 0.156449 | 0.046077 | 0.022637 | 0.017031 | -0.086168 | 0.045617 | -0.104000 | -0.031493 | -0.332306 |
| 12 | 0.057513 | 0.091477 | 0.014362 | 0.089639 | 0.070061 | 0.225447 | 0.049025 | -0.009631 | -0.024883 | -0.008671 | ... | -0.044172 | 0.153855 | 0.022403 | -0.147228 | -0.052122 | -0.166953 | -0.096372 | -0.004817 | -0.140800 | -0.093225 |
| 13 | -0.024837 | 0.074048 | 0.031450 | -0.001343 | -0.079062 | -0.097576 | -0.087464 | -0.019965 | -0.065182 | -0.040868 | ... | 0.299080 | 0.319338 | 0.068809 | 0.099744 | -0.036524 | 0.050107 | -0.026607 | 0.120002 | -0.028513 | -0.044312 |
| 14 | 0.291867 | -0.177947 | 0.017385 | 0.167961 | 0.152875 | -0.125681 | 0.079799 | -0.157281 | 0.025311 | 0.199349 | ... | -0.199750 | 0.181231 | 0.028121 | 0.028060 | -0.091075 | 0.100254 | -0.121050 | -0.045223 | 0.056962 | -0.171290 |
15 rows × 60 columns
from numpy import max, abs
loadings = pd.DataFrame(pca.components_[0:4, :], columns=df_wide.columns)
maxPC = 1.01 * max(max(abs(loadings.loc[0:4, :])))
f, axes = plt.subplots(4, 1, figsize=(15, 10), sharex=True)
for i, ax in enumerate(axes):
pc_loadings = loadings.loc[i, :]
colors = ['C0' if l > 0 else 'C1' for l in pc_loadings]
ax.axhline(color='#888888')
pc_loadings.plot.bar(ax=ax, color=colors)
ax.set_ylabel(f'PC{i+1}')
ax.set_ylim(-maxPC, maxPC)
plt.tight_layout()
plt.show()
Most of the loadings of PC1 has the same sign. This component is a contrast in the direction of BAGFS, CCOLA, HALKB and SKBNK.\ As I cannot interpret the industries of stock, I couldn't conclude which industries were prevailing for each PC. Nevertheless, as far as I can google stocks, I may claim that PC2 is driven by banking sector, and PC4 has the movement opposite to electiricty/energy market.\ Another perspective may be the shareholder structure of the companies. Domestic capital, foreign capital and/or joint capital would add a value to analysis in order to reveal latent patterns.
I have selected a new set as GARAN, THYAO, ASELS, TTKOM to analyse with Google data.\ In addition to search volume of stocks, I would like to get Turkish Lira and TCMB search volumes over time. As I expect some correlation.\ Further, I'll get the search volume of the term of "war", "hisse yatirim" and "ekonomik kriz".
gtrends = pd.read_csv('gtrends.csv', sep = ';')
gtrends.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 82 entries, 0 to 81 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yearmonth 82 non-null int64 1 TRY 82 non-null int64 2 GARAN 82 non-null int64 3 THYAO 82 non-null int64 4 ASELS 82 non-null int64 5 TTKOM 82 non-null int64 6 WAR 82 non-null int64 7 TCMB 82 non-null int64 8 HISSE_YATIRIM 82 non-null int64 9 ECO_CRISIS 82 non-null int64 dtypes: int64(10) memory usage: 6.5 KB
df_gt = df_wide[['GARAN', 'THYAO', 'ASELS', 'TTKOM']].reset_index(drop = True).copy()
df_gt['yearmonth'] = df_wide.index.to_frame(index = False)['timestamp_dt'].dt.year*100+df_wide.index.to_frame(index = False)['timestamp_dt'].dt.month
print(df_gt.info())
print(df_gt.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50011 entries, 0 to 50010 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GARAN 50011 non-null float64 1 THYAO 50011 non-null float64 2 ASELS 50011 non-null float64 3 TTKOM 50011 non-null float64 4 yearmonth 50011 non-null int32 dtypes: float64(4), int32(1) memory usage: 1.7 MB None short_name GARAN THYAO ASELS TTKOM yearmonth 0 6.3715 3.3661 2.9948 4.2639 201209 1 6.3386 3.3574 2.9948 4.2521 201209 2 6.3386 3.3661 2.9855 4.2521 201209 3 6.3715 3.3748 2.9855 4.2521 201209 4 6.3715 3.3748 2.9760 4.2521 201209
df_gt_agg = df_gt.groupby('yearmonth')[['GARAN', 'THYAO', 'ASELS', 'TTKOM']].mean()
df_gt_comp = df_gt_agg.merge(gtrends, on = 'yearmonth', how = 'left', suffixes = ['_price', '_search'])
df_gt_comp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 83 entries, 0 to 82 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yearmonth 83 non-null int32 1 GARAN_price 83 non-null float64 2 THYAO_price 83 non-null float64 3 ASELS_price 83 non-null float64 4 TTKOM_price 83 non-null float64 5 TRY 82 non-null float64 6 GARAN_search 82 non-null float64 7 THYAO_search 82 non-null float64 8 ASELS_search 82 non-null float64 9 TTKOM_search 82 non-null float64 10 WAR 82 non-null float64 11 TCMB 82 non-null float64 12 HISSE_YATIRIM 82 non-null float64 13 ECO_CRISIS 82 non-null float64 dtypes: float64(13), int32(1) memory usage: 8.9 KB
df_gt_comp.index = df_gt_comp['yearmonth']
df_gt_comp.drop(columns = ['yearmonth'], inplace = True)
df_gt_comp.corr()
| GARAN_price | THYAO_price | ASELS_price | TTKOM_price | TRY | GARAN_search | THYAO_search | ASELS_search | TTKOM_search | WAR | TCMB | HISSE_YATIRIM | ECO_CRISIS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GARAN_price | 1.000000 | 0.487095 | 0.681801 | 0.440804 | 0.189476 | -0.591768 | 0.518645 | 0.501831 | -0.213711 | -0.073987 | 0.440271 | 0.511696 | -0.013251 |
| THYAO_price | 0.487095 | 1.000000 | 0.776196 | -0.266401 | 0.708602 | -0.601653 | 0.339088 | 0.762021 | 0.225743 | 0.012838 | 0.661255 | 0.579338 | 0.572691 |
| ASELS_price | 0.681801 | 0.776196 | 1.000000 | -0.122877 | 0.687123 | -0.904185 | 0.711621 | 0.897638 | 0.092118 | -0.113522 | 0.827285 | 0.722440 | 0.488478 |
| TTKOM_price | 0.440804 | -0.266401 | -0.122877 | 1.000000 | -0.544373 | 0.169204 | -0.123347 | -0.295553 | -0.467888 | 0.132264 | -0.408956 | -0.164994 | -0.524568 |
| TRY | 0.189476 | 0.708602 | 0.687123 | -0.544373 | 1.000000 | -0.649617 | 0.451095 | 0.696845 | 0.189140 | -0.062534 | 0.782933 | 0.506736 | 0.886001 |
| GARAN_search | -0.591768 | -0.601653 | -0.904185 | 0.169204 | -0.649617 | 1.000000 | -0.806698 | -0.753180 | -0.035518 | 0.134208 | -0.842710 | -0.686532 | -0.484149 |
| THYAO_search | 0.518645 | 0.339088 | 0.711621 | -0.123347 | 0.451095 | -0.806698 | 1.000000 | 0.542887 | 0.058637 | -0.186357 | 0.610799 | 0.620097 | 0.196061 |
| ASELS_search | 0.501831 | 0.762021 | 0.897638 | -0.295553 | 0.696845 | -0.753180 | 0.542887 | 1.000000 | 0.127954 | -0.038641 | 0.804261 | 0.622576 | 0.549421 |
| TTKOM_search | -0.213711 | 0.225743 | 0.092118 | -0.467888 | 0.189140 | -0.035518 | 0.058637 | 0.127954 | 1.000000 | -0.066593 | 0.205423 | 0.074203 | 0.159880 |
| WAR | -0.073987 | 0.012838 | -0.113522 | 0.132264 | -0.062534 | 0.134208 | -0.186357 | -0.038641 | -0.066593 | 1.000000 | -0.136062 | 0.034693 | 0.081840 |
| TCMB | 0.440271 | 0.661255 | 0.827285 | -0.408956 | 0.782933 | -0.842710 | 0.610799 | 0.804261 | 0.205423 | -0.136062 | 1.000000 | 0.613644 | 0.670010 |
| HISSE_YATIRIM | 0.511696 | 0.579338 | 0.722440 | -0.164994 | 0.506736 | -0.686532 | 0.620097 | 0.622576 | 0.074203 | 0.034693 | 0.613644 | 1.000000 | 0.363834 |
| ECO_CRISIS | -0.013251 | 0.572691 | 0.488478 | -0.524568 | 0.886001 | -0.484149 | 0.196061 | 0.549421 | 0.159880 | 0.081840 | 0.670010 | 0.363834 | 1.000000 |
The strongest correlation is between ASELS price and its search volume.\ The correlation with "hisse yatirim" is ASELS as it may imply ASELS took the attention of "investors" more than other selected stocks in the table.\ Search for "war" has weak correlations with stock prices.